Allow custom check messages#1092
Conversation
Co-authored-by: Isaac
|
✅ 738/738 passed, 41 skipped, 4h43m32s total Running from acceptance #4562 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1092 +/- ##
==========================================
- Coverage 92.06% 91.13% -0.94%
==========================================
Files 101 101
Lines 9546 9557 +11
==========================================
- Hits 8789 8710 -79
- Misses 757 847 +90
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…eholders
The custom-message field is renamed to message_expr and now accepts either a
Spark SQL expression string or a Spark Column object. The expression is
evaluated as-is — DQX no longer substitutes {rule_name}, {check_func_name}, or
{column_value} placeholders. Callers reference columns and identifiers directly
inside the expression.
- DQRule / DQForEachColRule: rename message -> message_expr (str | Column | None)
- to_dict serialises message_expr only when it is a string (Column is in-process)
- DQRuleManager._build_message_col simplified: F.expr(str) or use Column directly
- checks_serializer: read message_expr key from metadata, pass to constructors
- Unit tests rewritten for the new attribute name and Column variant
- Integration tests rewritten without placeholders; the existing
test_apply_checks_without_custom_message_unchanged is dropped (the Engine's
no-message default path is already covered by surrounding tests)
- Docs (quality_checks.mdx) rewritten to drop the placeholder section, add a
Column-based Python example, and annotate each example with the rendered
message string
Co-authored-by: Isaac
mwojtyczka
left a comment
There was a problem hiding this comment.
Small, well-isolated change with good happy-path test coverage. Posting 11 inline comments below covering: a security exposure (arbitrary F.expr evaluation of strings sourced from YAML/metadata), a UX trap with unquoted strings, several maintainability/polish items, and one test bug.
One non-inline issue: PR description doesn't match the implementation.
The description says:
Add an optional
messagecallable parameter ... that allows users to define custom check failure messages. The callable receives rule context (rule_name,check_func_name,check_func_args,column_value) and returns a Spark Column expression…
But the actual API is message_expr: str | Column | None — no callable, no rule context. Either the description should be rewritten to match what shipped, or the implementation should provide the callable form the description promises. Reviewers (and future readers running git log / GitHub PR archeology) rely on the description for context.
| return condition | ||
|
|
||
| custom_message = ( | ||
| self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr) |
There was a problem hiding this comment.
Arbitrary Spark SQL evaluation of strings from non-trusted sources.
F.expr(self.check.message_expr) runs raw SQL with no validation. When checks come from YAML files, table-stored checks, or any persisted metadata source, this becomes an arbitrary-expression evaluation surface:
- check: { function: is_not_null, arguments: { column: id } }
message_expr: "concat('hi ', (select secret_key from secrets.config limit 1))"Not a classic SQL injection (Spark expr doesn't allow DDL/DML), but it does allow:
- Subqueries against any catalog the running session can read.
- Calls to potentially expensive UDFs (
reflect,java_method, custom Spark UDFs). - Information disclosure via embedded subqueries that surface secrets in the violation message.
The project's CLAUDE.md security policy states:
User-provided or templated SQL must be validated with
is_sql_query_safe()fromutils.pybefore execution. RaiseUnsafeSqlQueryErrorwhen the query is unsafe.
Options:
- Validate
message_exprstrings viais_sql_query_safe()(or a stricter subset that disallows subqueries) before passing toF.expr. - Restrict the string form to literal/
concat/coalescepatterns only. - At minimum, document
message_exprstrings as trusted input so metadata sources validate before storage.
| return condition | ||
|
|
||
| custom_message = ( | ||
| self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr) |
There was a problem hiding this comment.
UX trap: a plain string passes through F.expr and fails confusingly.
A new user will write:
DQRowRule(..., message_expr="Email must not be null")F.expr("Email must not be null") parses this as Email (column ref) must (column ref) not be null and fails with a confusing column-not-found error far from the rule definition.
The docs example correctly shows "'Email must not be null'" (Python double-quote wrapping a SQL single-quoted literal), but that's an awkward shape and easy to forget.
Options:
- Auto-detect plain strings (no parens, no SQL operators) and wrap with
F.litautomatically. - Best-effort validate at construction time (
__post_init__triesF.expronce and surfaces a clear error mentioning the SQL-quote requirement). - At minimum, expand the docs Admonition with a "common mistake" callout showing the unquoted form failing.
| custom_message = ( | ||
| self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr) | ||
| ) | ||
| return F.when(condition.isNotNull(), custom_message).otherwise(F.lit(None).cast("string")) |
There was a problem hiding this comment.
No length cap on rendered message.
A pathological expression like concat(repeat('x', 100000)) produces a 100k-char message embedded in every result row's struct — bloating Delta files and breaking downstream consumers that assume short messages. Other parts of DQX cap at 500 chars (_LLM_FIELD_MAX_LEN in the anomaly explainer).
Consider wrapping with substring(message, 1, MAX_MESSAGE_LEN) here, or applying the cap as a Spark expression on custom_message before the F.when.
| user_metadata: dict[str, str] | None = None | ||
| message_expr: str | Column | None = None | ||
|
|
||
| def __post_init__(self): |
There was a problem hiding this comment.
Add best-effort message_expr validation in __post_init__.
A typo'd SQL string ("concat(invalid") currently fails only when the rule is applied to a DataFrame, far from where it was defined. A best-effort validation here surfaces the error at the call site:
if isinstance(self.message_expr, str) and self.message_expr:
try:
F.expr(self.message_expr)
except Exception as exc:
raise InvalidParameterError(
f"message_expr is not a valid Spark SQL expression: {exc}. "
f"Note: literal strings must be wrapped in SQL quotes, e.g. \"'my message'\"."
) from excCheap (build-only — no DataFrame eval), and the error message hints at the unquoted-string trap in the previous comment.
| check_func_args: list[Any] = field(default_factory=list) | ||
| check_func_kwargs: dict[str, Any] = field(default_factory=dict) | ||
| user_metadata: dict[str, str] | None = None | ||
| message_expr: str | Column | None = None |
There was a problem hiding this comment.
Overloaded parameter type — consider splitting before this becomes public API.
message_expr: str | Column forces callers (and serialisers, and future API extensions) to runtime-dispatch on type. Cleaner alternatives:
message_sql: str | Noneandmessage_col: Column | Nonewith mutual-exclusion validation.- A single
message: Column | Noneplus amessage_from_sql(s: str) -> Columnhelper that wrapsF.exprwith the validation/sanitisation from the prior comments.
Not a blocker — your current shape works — but worth deciding now since changing the field name later is a breaking change.
| * *check_func_args* (optional) - Positional arguments for the check function (excluding *column*). | ||
| * *check_func_kwargs* (optional) - Keyword arguments for the check function (excluding *column*). | ||
| * *user_metadata* (optional) - User-defined key-value pairs added to metadata generated by the check. | ||
| * *message_expr* (optional) - User-defined expression used as the check failure message. Accepts either |
There was a problem hiding this comment.
No placeholder support for rule_name / check_func_name / column_value — document the limitation.
The PR description lists these as expected context, but the implementation evaluates the expression as-is. Users who want a message like "<rule_name> failed for value <column_value>" have to hand-type the rule name twice and inline the column reference (and the column reference doesn't generalise across DQForEachColRule instances).
This will be a frequent feature request as a follow-up. Worth either:
- Documenting the limitation explicitly in the docstring (
"placeholder substitution is not supported; reference column names directly via Spark SQL"). - Considering whether to add
{rule_name},{column_value}placeholder substitution in this PR or as a tracked follow-up. If you choose to defer, link to a follow-up issue here so users searching the codebase find it.
| # Only string expressions can be round-tripped through metadata; Column objects are | ||
| # in-process Spark expressions with no canonical YAML/JSON representation. | ||
| if isinstance(self.message_expr, str) and self.message_expr: | ||
| metadata["message_expr"] = self.message_expr |
There was a problem hiding this comment.
to_dict() silently drops Column-typed message_expr — users won't notice.
A user who passes message_expr=F.concat(...) and calls to_dict() to persist their rules to YAML/Delta loses the message with no signal. They'll discover it only when re-loading and seeing the default message instead.
Options:
- Log a
WARNINGwhen a Column-typedmessage_expris encountered during serialisation:"Column message_expr cannot be serialised; falling back to default message on round-trip." - Or raise
InvalidParameterError("Column-typed message_expr cannot be serialised; pass a SQL expression string if you need round-trip persistence").
Logging is friendlier; raising forces the user to make an explicit choice.
| check_func_args: list[Any] = field(default_factory=list) | ||
| check_func_kwargs: dict[str, Any] = field(default_factory=dict) | ||
| user_metadata: dict[str, str] | None = None | ||
| message_expr: str | Column | None = None |
There was a problem hiding this comment.
DQForEachColRule.message_expr is shared across all generated rules — flag in the field comment.
The integration test correctly notes this ("The same expression is used for every generated rule. To produce a per-column message, reference each column inline"), but the field itself has no docstring or warning. A user expecting per-column message substitution will be surprised when all generated rules emit identical messages — and won't know the workaround.
Add a one-line comment on the field describing the shared-expression behaviour, ideally with the inline-reference workaround pointer:
# Shared across every generated rule; reference column names inline (e.g. "concat('a=', cast(a as string))")
# if you need per-column messages, or construct rules individually instead of via DQForEachColRule.
message_expr: str | Column | None = None|
|
||
| def test_dq_dataset_rule_message_expr_defaults_to_none(): | ||
| """DQDatasetRule without message_expr should default to None.""" | ||
| rule = DQRowRule(check_func=is_not_null_and_not_empty, column="id") |
There was a problem hiding this comment.
Test bug: this test claims DQDatasetRule but constructs DQRowRule.
def test_dq_dataset_rule_message_expr_defaults_to_none():
"""DQDatasetRule without message_expr should default to None."""
rule = DQRowRule(check_func=is_not_null_and_not_empty, column="id")
# ^^^^^^^^^ should be DQDatasetRule with a dataset-level check function
assert rule.message_expr is NoneThe test name and docstring claim DQDatasetRule, the body uses DQRowRule. Either rename the test (it's now a duplicate of test_dq_row_rule_message_expr_defaults_to_none above), or fix the body to actually construct a DQDatasetRule with a @register_rule("dataset") check function — at which point the test exercises something new.
| message_expr=F.lit("Custom error"), | ||
| ) | ||
| rule_dict = rule.to_dict() | ||
| assert "message_expr" not in rule_dict |
There was a problem hiding this comment.
Add error-path and round-trip tests.
Current unit tests cover construction + to_dict() happy paths. Two missing categories:
1. Error paths — what happens when message_expr is malformed?
def test_apply_checks_invalid_message_expr_fails_clearly(spark):
rule = DQRowRule(
check_func=is_not_null,
column="id",
message_expr="concat(broken syntax",
)
df = spark.createDataFrame([(None,)], "id: int")
with pytest.raises(Exception):
DQEngine(...).apply_checks(df, [rule])Locks in the failure mode and gives users a debugging anchor.
2. Explicit round-trip — implicit only today.
def test_string_message_expr_round_trips_through_metadata():
rule = DQRowRule(..., message_expr="'static msg'")
serialized = rule.to_dict()
# ... deserialize, apply, assert message preserved.
def test_column_message_expr_lost_on_round_trip():
rule = DQRowRule(..., message_expr=F.lit("col msg"))
serialized = rule.to_dict()
assert "message_expr" not in serialized # documented behaviour| return condition | ||
|
|
||
| custom_message = ( | ||
| self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr) |
There was a problem hiding this comment.
Check if it is a string, to avoid problems with spark Column vs spark connect Column
Changes
Add an optional
messagecallable parameter toDQRule,DQRowRule,DQDatasetRule, andDQForEachColRulethat allows users to define custom check failure messages. The callable receives rule context (rule_name,check_func_name,check_func_args,column_value) and returns a Spark Column expression, enabling dynamic messages that can include column values.When
messageisNone(the default), the existing auto-generated message behavior is preserved. When provided, the custom message replaces the default message for failed rows while maintaining null/non-null semantics for passing rows.Linked issues
Resolves #958
Tests